Open In Colab

Libraries and utility functions

Load data

Data pre-processing

Structure Exploration

The Name feature is not necessary and we are going to drop it. The feature # also is going to be dropped because pandas provides indexing. Also the features Type 1 and Type 2 are categorical features so we are going to encode them and use for classification.

Dropping unnecessary features

Missing values detection

Before we apply any ML models, we need to examine wether there are missing values in the data.

There are missing values in our dataset, the strategy which we are going to use is one of the simplest, i.e. Mode Imputation (because the missing data are categorical). The drawbacks are not going to be discussed any further, but in general case more sophisticated approach should be considered, such as building probabalistic models to handle the missing data.

**Conclusion**: There are not features with missing values in the dataset.

Encoding categorical features

In order to perform the supervised learning algorithms discussed in the course Machine Learning, we need to encode the features. For this purpouse sklearn.preprocessing.LabelEncoder is going to be leveraged.

**Conclusion**: All of the data are numerical and can be represented within vector spaces.

Exploratory Data Analysis (EDA)

Distribution of the features

Pair-plots

Between feature dependance

Visualization with reduced dimensionality

PCA approach for dimensionality reduction is going to be performed before the data are visualized in 3D space.

From the visualization we can clearly see a border for Legendary between 100 and 200 on PC1-axis and parallel to PC3-axis, but there is some overlapping which can be due to noise, PCA or due to the not-linearity of the problem.

Clear border between the classes of Generation can not be seen in this visualization, the cause can be the data or just PCA not capturing enough information.

Classification for Legendary

Before we apply any ML models we need to split the data into X and y and training and test data. We are going to use sklearn.model_selection train_test_split to provide bootstrap-based samping. We are going to use 20% of the dataset for testing and 80% for training the models.

Naïve Bayes

Model Training

Model Evaluation

Linear Discriminant Analysis (LDA)

Model Training

Model Evaluation

Quadratic Discriminant Analysis (QDA)

Model Training

Model Evaluation

Models summary and conclusion

NB provides the best accuracy, f1 and precission, then QDA performs better overall and LDA has slightly lower scores than QDA. As a conclusion the features conditional independance assumption in NB provides the best results in this setting.

Classification for Generation

Before we apply any ML models we need to split the data into X and y and training and test data. We are going to use sklearn.model_selection train_test_split to provide bootstrap-based samping. We are going to use 20% of the dataset for testing and 80% for training the models.

Naïve Bayes

Model Training

Model Evaluation

Linear Discriminant Analysis (LDA)

Model Training

Model Evaluation

Quadratic Discriminant Analysis (QDA)

Model Training

Model Evaluation

Conclusion

Due to the unseparability of the classes, all of the models perform poorly, i.e. NB has accuracy of 0.02, LDA has 0.21 and QDA dominates with 0.27. As conclusion, another models should be considered to model the unseperability and the overall problem structure.